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Fixed-point  Design  of  the 
Lattice-reduction-aided  Iterative  Detection  and 
Decoding  Receiver  for  Coded  MIMO  Systems 

Qingsong  Wen,  Minzhen  Ren,  Xiaoli  Ma 
School  of  Electrical  and  Computer  Engineering, 

Georgia  Institute  of  Technology, Atlanta,  Georgia  30332 


I.  Introduction 

With  the  evolution  of  wireless  communication  systems,  the  multiple  input  multiple  output 
(MIMO)  system  has  been  adopted  to  provide  higher  data  rate  [1],  In  addition,  error  control 
codes  (ECC)  are  usually  included  in  the  system  to  enhance  the  information  reliability,  e.g., 
Turbo  Codes  [2]  and  Low  Density  Parity  Check  (LDPC)  codes  [3].  The  challenge  to  apply  both 
MIMO  and  ECC  into  wireless  systems  is  on  designing  a  reliable  but  low-complexity  receiver. 

The  optimal  receiver  for  coded  MIMO  systems  is  to  use  a  joint  detector  and  decoder  for 
the  whole  coded  data  block,  which  is  extremely  complex  and  infeasible  in  the  practical  system 
due  to  the  long  length  of  coded  data  block.  Although  decoupled  detectors  and  decoders  can 
significantly  reduce  the  complexity,  the  performance  would  be  largely  degraded  compared  to 
the  optimal  receiver.  In  order  to  balance  the  complexity  and  performance,  the  receiver  with 
iterative  detection  and  decoding  (IDD)  is  proposed  in  [4],  where  the  separate  soft-input  soft- 
output  (SISO)  detector  and  SISO  decoder  are  used  to  achieve  the  near-optimal  performance  by 
exchanging  extrinsic  information  iteratively. 

The  optimal  SISO  detector  under  IDD  for  coded  MIMO  systems  would  be  the  maximum  a 
posteriori  (MAP)  detector,  which  is  often  with  high  complexity  especially  when  the  constellation 
size  and/or  the  channel  dimension  are  high.  The  list  MIMO  detectors,  such  as  the  list  sphere 
detector  [4]  and  the  list  sequential  detector  [5],  are  an  attractive  choice  as  they  allow  a  flexible 
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tradeoff  between  performance  and  complexity.  One  key  issue  of  the  list  MIMO  detector  is  to 
generate  a  list  of  candidates  containing  the  transmitted  symbol  vectors  with  low  complexity. 
The  way  to  find  the  list  and  the  number  of  candidates  in  the  list  are  directly  related  to  both 
performance  and  complexity.  So  it  is  desirable  if  the  detector  can  obtain  the  near-optimal 
performance  only  using  a  small  number  of  candidates. 

Recently,  lattice  reduction  (LR)  technique  has  been  proposed  to  improve  the  performance  of 
MIMO  detector  in  [6],  [7],  and  [8],  by  transforming  the  channel  matrix  into  a  better-conditioned 
matrix.  It  is  shown  that  LR-aided  linear  detectors  can  achieve  the  full  diversity  of  the  maximum 
likelihood  (ML)  receiver.  Furthermore,  the  combination  of  LR  with  list  MIMO  detection  like 
K-best  detector  [9]  shows  that  it  can  maintain  near-ML  performance  even  with  very  low  K 
values  (the  number  of  candidates),  which  means  much  lower  complexity  of  the  detector.  The 
LR-aided  IDD  algorithms  with  list  MIMO  detector  have  been  well  studied  in  the  literature  [10]. 
However,  there  are  few  papers  focusing  on  the  fixed-point  design  for  the  whole  LR-aided  IDD 
system,  which  is  a  key  step  for  practical  hardware  implementation  in  VLSI  chips  or  FPGAs. 
In  this  paper,  we  evaluated  the  LR-aided  IDD  performance  under  finite  precision  in  operands 
and  arithmetic  operations,  and  designed  the  detailed  fixed-point  implementation  for  the  whole 
LR-aided  IDD  receiver  based  on  that  the  bit  error  rate  (BER)  performance  of  the  fixed-point 
system  could  be  within  0.2dB  degradation  compared  to  the  performance  of  the  corresponding 
floating-point  system. 

The  rest  of  this  paper  is  organized  as  follows.  Section  II  presents  the  system  model  of  the 
LR-aided  IDD  receiver  for  MIMO  coded  systems.  Section  III  introduces  the  key  algorithms 
used  in  the  fixed-point  LR-aided  IDD  receiver.  Section  IV  provides  the  detailed  fixed-point 
implementation  for  the  whole  LR-aided  IDD  receiver  followed  by  the  conclusion  in  Section  V. 

II.  System  Model 

Consider  a  coded  multiplexing  transmission  system  depicted  in  Fig.  1.  At  the  transmitter,  a 
sequence  of  binary  information  bits  b  is  random  produced,  passed  the  ECC,  and  interleaved. 
Then  the  coded  sequence  c  is  mapped  into  a  symbol  sequence  s  where  the  constellation  size  is 
k  bits/symbol.  For  the  system  with  N  transmit  and  M  receive  antennas,  the  MIMO  transmission 
can  be  expressed  as: 


y  =  Hs  +  w, 


(1) 
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where  theiT  is  assumed  as  a  M  x  Ar  complex  Gaussian  channel  matrix  with  zero  mean  and  unit 
variance,  the  N  x  1  vector  c  consists  of  the  information  symbols  drawn  from  a  constellation 
S,  y  is  the  Mxl  received  vector,  and  w  is  the  complex  additive  white  Gaussian  noise  with 
variance  o2w.  Suppose  that  E[ssH]  =  IN  ,  and  E[wwH]  =  ct^Im-  We  assume  that  the  channel 
matrix  II  is  time-invariant  during  a  certain  block  which  is  greater  than  a  symbol  period  and 
change  independently  from  block  to  block,  and  it  is  known  at  the  receiver  but  unknown  at  the 
transmitter. 


Fig.  1 .  Block  diagram  of  LR-aided  IDD  receiver  for  coded  MIMO  systems 


At  the  receiver,  LR-aided  IDD  structure  is  adopted  to  exchange  extrinsic  information  between 
the  SISO  detector  and  the  SISO  decoder.  The  extrinsic  information  Lj?  ,  is  first  calculated  by  the 
SISO  detector  based  on  the  observation  y ,  the  channel  H.  and  the  pror  information  LAI  which 
is  fed  back  by  the  SISO  decoder.  Then,  the  extrinsic  information  from  the  detector  is  passed 
through  the  interleaver  to  the  SISO  decoder,  which  takes  it  as  priori  information  L^d  to  obtain 
the  information  bits  and  calculate  new  extrinsic  information  LEjd  to  feed  back  to  the  detector. 
Thus,  the  receiver  is  designed  in  an  iterative  way  between  the  detection  and  decoding. 

III.  Key  Algorithms  in  LR-aided  IDD  receiver 
A.  Lattice  Reduction 

In  the  MIMO  transmission  model  in  Eq.  (1),  the  received  signal  vector  y  is  the  noisy 
observation  of  the  vector  Hs,  which  is  in  the  lattice  spanned  by  the  columns  of  H  since 
all  the  entries  of  s  can  be  transformed  to  complex  integers  by  shifting  and  scaling.  In  general, 
a  lattice  has  more  than  one  set  of  basis  vectors. 
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There  exist  some  bases  that  span  the  same  lattice  as  H  but  are  closer  to  orthogonality  than 
H.  The  process  of  finding  a  basis  closer  to  orthogonality  is  called  LR.  Theoretically,  finding  an 
optimal  set  of  bases  (closest  to  orthogonality)  in  a  lattice  is  computationally  expensive.  Thus, 
the  ultimate  goal  of  LR  algorithms  is  to  find  a  ’’better”  channel  matrix  H  =  HT  where  T  as  a 
unimodular  matrix,  which  means  that  all  the  entries  of  T  and  T  1  are  complex  integers  and  the 
determinant  of  T,  is  ±1  or  ±j.  The  restrictions  on  the  matrix  T  ensure  that  the  lattice  generated 
by  H  is  the  same  as  that  of  H. 

Generally,  LR  techniques  involve  preprocessing  H  to  produce  a  reduced-lattice  basis  H  = 
HT.  This  factorization  allows  us  to  rewrite  the  system  in  Eq.  (1)  as 

y  =  HT(T  ls )  +  w  =  Hz  +  w.  (2) 

Here  we  adopt  the  complex  LLL  (CLLL)  algorithm  [8],  [11]  to  perform  the  LR  on  the  channel 
matrix  H.  The  detailed  pseudo-code  of  the  CLLL  algorithm  can  be  summarized  as  follows  in 
Fig.  2  [8], 


INPUT:  H\  OUTPUT:  Q,  Ft.  T _ 

(1)  LQ.-R]  =  QR  Decomposition) if); 

(2) Se  (|, l); 

(3)  to  =  size(H,  2); 

(4  )T=  Jm; 

(5)  k  =  2; 

(6)  while  k  <  m 

(7)  for  n  =  k  —  1  :  —  1  :  1 

(8)  u  =  round((R(n,  k) / R(n,n)))\ 

(9)  if  u  ~=  0 

(10)  i?(l  :  n,  k)  =  i?(l  :  n,  k)  —  u  ■  R(  1  :  n,  n); 

(11)  T(:,k)  =  T(:,k)-u-T(:,n); 

(12)  end 

(13)  end 

(14)  if  5\R(k-l,k-l)\2  >\R(k,k)\2 +  \R(k-l,k)\2 

(15)  Swap  the  (k-l)th  and  kth  columns  in  R  and  T 

R(k-l,k-l) 

1 1  R(  k—l:k,k—l)\\  ’ 
R(k,k- 1) 

||  R(k  —  l\k,k—l)  || 

(17)  R{k  —  1  :  k,k  —  1  :  to)  =  &R(k  —  1  :  k,  k  —  1  :  m); 

(18)  Q(:,fe-  1  :  k)  =  Q(:,k-  1  :  fc)©H; 

(19)  k  =  max(fc  —  1,  2); 

(20)  else 

(21)  k  =  k  +  1; 

(22)  end 

(23)  end 


Fig.  2.  Pseudo-code  of  CLLL  algorithm 
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For  the  CLLL  algorithm,  the  main  computation  parts  contain  the  QR  Decomposition  (Line  1 
in  Fig.  2),  the  Size  Reduction  part  (Line  7-13  in  Fig.  2),  the  Complex  Lovasz  Condition  (Line  14 
in  Fig.  2),  and  the  Basis  Updating  part  (Line  16-18  in  Fig.  2).  In  order  to  facilitate  the  fixed-point 
design  while  to  keep  the  performance  at  the  same  time,  some  modifications  are  adopted  in  [12], 
where  the  Relaxed  Size  Reduction  Condition  is  defined  for  the  calculating  of  Size  Reduction 
part,  the  Complex  Lovasz  Condition  is  replaced  by  the  Siegel  Condition,  the  integer-rounded 
division  (Line  8  in  Fig.  2)  is  implemented  by  using  a  single  Newton-Raphson  (NR)  iteration 
method,  and  the  calculation  of  0  (Line  16  in  Fig.  2)  is  completed  by  Householder  CORDIC 
algorithm. 

B.  List  MIMO  detector 

For  the  list  MIMO  detector  in  LR-aided  IDD  receiver,  the  authors  in  [10]  proposed  three 
methods,  i.e.  fixed  radius  algorithm  (FRA),  fixed  candidates  algorithm  (FCA),  and  fixed  memory- 
usage  algorithm  (FMA).  FRA  as  well  as  FMA  is  a  combination  of  sphere  decoding  [13]  and  LR, 
which  searches  all  possible  candidates  in  the  sphere.  In  this  case  the  number  of  candidates  is 
random,  which  may  cause  difficulty  on  hardware  implementation.  FCA  is  a  combination  of  K- 
best  algorithm  [14]  and  LR,  which  applies  an  element-by-element  searching  with  a  fixed  number 
of  points  on  each  layer  so  that  it  is  suitable  for  the  hardware  implementation. 

For  LR-aided  linear  hard  detectors,  LR  is  first  applied  on  the  channel  matrix  H  followed  by 
the  linear  equalization  based  on  the  reduced-lattice  basis  H.  For  example,  when  Zero  Forcing 
(ZF)  equalizer  is  adopted,  we  can  get 

x  =  H]y  =  T~1s  +  H]w  =  z  +  n.  (3) 

Then  we  need  to  obtain  an  estimate  of  2  in  Eq.  (3)  and  next  the  s  is  estimated  through  one-to-one 
mapping,  which  implies  we  need  to  get  a  candidate  list  of  2  in  the  list  MIMO  detector.  Different 
from  the  SD  method  in  [13],  here  the  sphere  is  built  in  the  2-domain  centered  at  LR-aided 
estimate  instead  of  the  s-domain  centered  at  ZF  estimate  or  other  estimate  from  preprocessing. 
However,  because  of  matrix  T,  the  constellation  of  2  is  not  ready.  Some  candidates  2  on  integer 
lattice  may  not  generate  valid  candidates  in  s-domain.  One  way  is  to  find  all  possible  2’s  and 
then  perform  searching,  which  costs  high  computational  complexity.  Since  our  final  goal  is  to 
obtain  s  not  2  and  the  alphabet  of  s  is  known,  so  we  need  to  find  the  list  of  candidates  on  s, 
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C s  as: 

Cs  =  {s  :  ||T_1s  -  x\\2  <  rz}.  (4) 

To  further  reduce  the  complexity,  we  can  apply  QR  decomposition  for  T  1  so  that  T_1  = 
QtRt,  then  we  obtain 

WT-'s  -  x\\2  =  \\Q%x  -  Rts\\2.  (5) 

Here  low  complexity  tree- searching  methods  can  be  performed  by  starting  from  the  bottom 
layer.  In  order  to  facilitate  the  hardware  implementation,  we  select  the  FCA  as  the  list  MIMO 
detector  in  the  LR-aided  IDD  receiver  for  fixed-point  design  because  its  breadth-first  tree-search 
method  has  a  fixed  throughput  like  K-best  method.  Furthermore,  FCA  always  includes  the  LR- 
aided  hard-decision  in  the  candidate  list  to  guarantee  diversity.  The  detailed  pseudo-code  of  the 
FCA  algorithm  can  be  summarized  as  follows  in  Fig.  3  [10]. 


Input:  y  II  Kp:  Output:  Cs 

Initialize:  Dist  =  zeros  (1,  Kp):  and  Cs  =  0 

_ Dist  records  the  distance  between  Rj  s  and  Qpx,  for  s  £  Cs 

51.  [fl,T]  =  CLLL  (II ): 

52.  Hard-decision  solution:  s^d', 

53.  [QxMt]  =  OR  decomposition  (T_  1 ): 

54.  x  -  H^y: 

55.  q  =  Q^x: 

56.  For  n  =  N  :  (—  1)  :  1 

57.  For  each  partial  candidate  vector  s;  €  Cs,  i  E  [1  ,KP\ 

58.  For  each  symbol  u;  £  S,  l  G  [f ,  |5|]  except  tlie  one  in  s^d 

59.  =  Dist(i)  +  |q(n)  —  Rj~{n,n  :  N)[m; 

S10.  end 

SI  1.  end 

512.  Find  the  Kp  minimum  values  in  Dt  and  save  tliern  as  Dist; 

513.  Save  the  corresponding  vectors  [u; ;  Sj]  as  Cs: 

514.  end 


Fig.  3.  Pseudo-code  of  List  MIMO  detector  with  FCA 


C.  QR  Decomposition 

QR  decomposition  (QRD)  is  an  essential  component  for  both  above-mentioned  CLLL  and  FCA 
algorithms.  The  QRD  transform  a  matrix  H  into  a  unitary  matrix  Q  and  an  upper  triangular 
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matrix  R.  i.e.,  H  =  QR.  Three  well  known  algorithms  have  been  proposed  to  perform  QRD: 
Gram-Schmidt  (GS)  algorithm.  Householder  transformation  (HT),  and  Givens  rotation  (GR). 

In  [15],  [16],  it  has  been  shown  that  GS  can  be  efficiently  implemented  through  Coordinat 
Rotation  Digital  Computer  (CORDIC)  and  Triangular  Systolic  Array  (TSA)  algorithms.  So  GS 
does  not  require  norm  and  division  operations  by  CORDIC  algorithm,  and  it  can  easily  adopt 
parallelism  when  processing  a  large  matrix  by  TSA  algorithm.  Furthermore,  GS  demonstrates 
higher  numerical  stability  with  VLSI  implementation  in  the  QRD  process  compared  with  GS 
and  HT  methods.  Due  to  these  reasons,  we  select  the  GS  method  as  the  QRD  algorithm  in  the 
LR-  aided  IDD  receiver. 

The  QRD  process  under  GS  algorithm  with  TSA  and  CORDIC  [15]  can  be  illustrated  on  a 
2x2  complex  matrix  H  as: 

Aeje“  Cej6d 
Be30b  De3°d 

where  j  =  yf—l,  A,  B .  C,  D  represent  the  magnitudes,  and  9a,  9b,  9C,  6d  stand  for  the  angles  of 
the  matrix  entries.  In  order  to  get  QRD  of  the  H  matrix,  the  H  is  first  transformed  by  the 
unitary  matrix  Q ,  expressed  by: 

cosOie^2  sinQiei03 
—sinOiejd2  cos9iejd3 

where  the  three  angles  91:92,93  are  calculated  as  follows: 

9i  =  tan~l  {C  /  A) , 

=  —9a, 

e3  =  -9b.  (8) 


Ri  —  Q\H  — 


After  the  above  transformation,  we  can  get  an  upper  triangular  matrix  Ri  as: 

cos91ej(>2  sin9 xe^3  \  (  Aej9a  Ce]°d  \  (  X  YeJ°y 
—si^iei02  cos9ie^3  J  l  Be^b  Deidd  J  l  0  Ze^6z 

Next,  the  Ri  is  transformed  by  another  simple  unitary  matrix  Q2  expressed  by: 

1  0 

0  e~3°z 


Q  2  “ 


(9) 


(10) 


So  that  we  get  the  last  R  matrix  of  the  QRD  process  as  follows: 


R  —  Q  0R1  — 


1  0 

0  e~jd * 


X  Yej0y 
0  Zeje* 


X  Yej0y 
0  Z 


(ID 


Based  on  the  above  procedure,  for  a  4  x  4  matrix  H,  the  QRD  can  be  implemented  through 
the  CORDIC-based  systolic  array  as  depicted  in  Fig.  4. 


0 
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1 

h14 

h13 

h12 


hi  1 


0 

0 

1 

0 

h24 

h23 

h22 

h21 


0 

1 

0 

0 

h34 

h33 

h32 

h31 


1 

0 

0 

0 

h44 

h43 

h42 

h41 


DU 


PE  W  PE  W  PE 


£  A 


PE  W  PE 


DU  y>\  PE 


q14  q13  q12  qll  r14  M3  M2  rll 


q24  q23  q22  q21  r24  r23  r22  0 


q34  q33  q32  q31  r34  r33  0  0 


A 

^I^>q44  q43  q42  q41  r44  0  0  0 


Fig.  4.  Architecture  diagram  of  CORDIC-based  triangular  systolic  array  to  solve  systolic  array. 


Three  different  types  of  cells  are  shown  in  Fig.  4:  delay  unit(DU),  processing  element(PE), 
and  rotational  unit(RU)  [15].  DU  delays  the  incoming  data  by  number  of  clock  cycles  that 
neighboring  cell  takes  to  process  the  data,  then  deliver  it  to  PE  when  it  is  available.  PE,  as  the 
most  complex  unit,  can  operate  in  either  vectoring  mode  or  rotation  mode.  In  vectoring  mode, 
PE  calculates  the  three  angles  described  in  (8),  stores  them  into  the  cell  memory,  and  meanwhile 
computes  the  norm  of  the  complex  vector.  The  computed  norm  is  passed  to  the  east  of  the  cell 
with  a  flag  that  requests  the  next  PE  to  operate  in  vectoring  mode.  In  rotation  mode,  PE  rotates 
the  incoming  complex  vector  with  the  angles  stored  in  the  cell  memory,  and  passes  the  results 
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from  north  to  south  port  and  west  to  east  port.  Fig.  5  depicts  the  structure  of  PE  in  both  modes 
with  data  flows.  Similarly,  RU  has  the  same  operation  modes  as  PE,  but  operates  in  vectoring 
mode  only  when  a  diagonal  element  from  a  channel  matrix  enters  from  the  north  port. 


Fig.  5.  PE  cell  structure  in  (i)  vectoring  mode  and  (ii)  rotation  mode  (a/b  denotes  the  data  that  is  coming  into  the  cell  from 
west/north  entrance). 


D.  LLR  Computing  between  the  detector  and  the  decoder 

The  extrinsic  information  Lej  shown  in  Fig.  1  is  usually  expressed  by  the  log-likelihood  ratio 
(LLR)  of  each  transmitted  bit  as  follows  [10]: 

LE,t(ci\y)  ~  lm^ceCsnSi+1\-^-\\y  -  Hs\\l  + cTLA,t- LAt(ci)} 

’  1  w  ^  J  (12) 

maxcec3nSi  _i  |-^r||y  -  Hs\\22  +  cTLA,t  +  LA,t(ci)j  , 

where  Cs  denotes  the  candidate  list  from  the  list  MIMO  detector  in  the  LR-aided  IDD  receiver, 

Sii+1  represents  the  subset  of  Cs  with  the  zth  bit  as  +1,  and  similarly  defined  Si- 1,  so  that 

Cs  =  Sit+ 1  n  Si-±. 
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Now  both  complexity  and  performance  of  the  list  MIMO  detector  depend  on  the  size  of  the 
candidate  list  Cs.  If  the  list  of  candidates  is  too  long,  the  results  will  be  near  to  the  optimal 
MAP  while  the  complexity  is  too  high  (near  the  exhaustive  search).  On  the  other  hand,  if  the  list 
is  too  short,  the  performance  will  be  degraded  due  to  the  inaccurate  LE,  values.  Furthermore, 
the  error  of  LE,t  is  especially  large  in  the  case  when  the  output  list  Cs  includes  only  candidates 
with  Ci  either  +1  or  —1,  which  may  result  in  very  large  values  in  Eq.  (12)  that  would  cause  the 
decoder  from  correcting  the  falsely  detected  data. 

The  undesirable  effect  of  the  small  candidates  in  the  list  MIMO  detector  can  be  reduced  by 
LLR  clipping  [4],  which  limits  the  dynamic  range  of  LLR  values  so  that  the  decoder  can  still 
overcome  the  error  data  from  the  detector.  The  LLR  clipping  is  defined  as  follows: 


r  Clip 

LE,t 


(■ °i\y ) 


LE,t(Ci\y)  \LE,t(ci\y)\  <  L  maxi 

sign  (yL E,t.{pi\y))  ■  Lmax  |-£/E,t(cj|2/)|  >  Lmax. 


(13) 


where  the  Lc^(ci\y)  is  the  clipped  LLR  and  the  Lmax  is  the  predefined  maximum  LLR  value 
for  LE}t.  Besides  improving  the  performance  of  the  list  MIMO  detector,  LLR  clipping  can  also 
reduce  the  word-length  of  the  fixed-point  design  and  decrease  the  complexity  of  the  hardware 
implementation. 


E.  Turbo  Decoding 

The  Turbo  decoder  contains  two  elementary  MAP  decoders  interconnected  to  each  other  by 
interleavers  (7r)  and  deinterleavers  (vr_1)  in  serial  way  as  shown  in  Fig.  6. 

Each  elementary  decoder  has  three  inputs:  the  systematic  bit  (yfcs),  the  parity  bits  from  the 
component  encoder  (ykP\  or  ykpri),  and  the  extrinsic  information  from  the  other  component 
decoder  ( L(uk )),  also  known  as  a-priori  information  of  the  systematic  bit.  During  the  Turbo 
decoding,  the  component  decoders  iteratively  exchange  the  probabilities  for  each  information  bit 
represented  by  LLR,  which  could  ameliorates  the  LLRs  of  the  information  bits  and  improves 
the  decoding  accuracy. 

For  the  fixed-point  implementation,  here  we  adopt  the  well  known  Max-Log-MAP  algorithm, 
which  has  near  the  same  performance  as  the  optimal  MAP  algorithm  while  with  much  lower 
complexity  [17].  For  the  Max-Log-MAP  algorithm,  the  calculation  process  of  each  constituent 
decoder  can  be  summarized  in  the  following  parts: 
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Fig.  6.  Architecture  diagram  of  Turbo  decoder 


1,  Branch  Metric  computing  (BM) 

1  Lc  7V_1 

7 k(s',  S )  =  -ukL(uk)  +  XklVkl- 

1=0 

2,  Forward  Recursion  computing  (FW) 

ak(s)  =  max{afc_i(s')  +  7 *(s',  s)};  A;  =  0, 1, iV  —  1. 

s' 

where  afc=0(s  =  0)  =  0,  and  afc=0(s  ^  0)  =  —  00. 

3,  Backward  Recursion  computing  (BW) 

Pk-i{s')  =  max{/3fc(s)  +  7 k(s',  s)};  k  =  N,  N  -  1, 1. 

s' 


(14) 


(15) 


(16) 


where  /3k=N(s  =  0)  =  0,  and  (3k=N(s  ^  0)  =  —00. 

4,  LLR  computing 

L{uk\y)  =  max  {afc-i(-s')+7fc(-s/,  s)+/3k(s)}-  max  {afc_i(s/)  +  7fc(‘S/,  s)+/3k(s)}. 

(s',s)=>-Ufc  =  +  l  (s',s)=>Ufc  =  - 1 

(17) 

5.  Extrinsic  information  computing 


Lek(uk)  L{uk\y)  Z/c3/cs  L(uk). 


(18) 
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In  the  Equ.  (14)-(18),  »/,.  is  the  information  bit  which  produces  the  transition  from  state  s' 
to  state  s  in  the  Turbo  trellis.  L(uk )  is  a  priori  information  and  xm  and  ijki  are  the  expected 
transmitted  symbols  and  the  actual  received  symbols,  respectively.  Lc  is  the  channel  reliability 
defined  as  Lc  =  2/a2,  where  a2  is  the  noise  average  power. 

IV.  Fixed-point  design  for  LR- aided  IDD  Receiver 

In  this  section,  the  fixed-point  design  for  the  whole  LR-aided  IDD  receiver  will  be  analyzed 
and  decided  based  on  the  algorithms  of  the  above  section.  For  the  fixed-point  simulation,  let 
FP(iwl,  fwl )  be  the  finite  representation  of  an  wl- bit  two’s  complement  number  where  fwl 
is  the  fractional  worldlength  and  iwl  is  the  integer  wordlength  including  a  sign  bit,  so  wl  = 
iwl  +  fwl.  In  order  to  compare  the  practical  fixed-point  performance  under  different  wordlength 
accuracy  with  the  ideal  floating-point  performance,  all  the  simulations  are  based  on  the  same 
system  parameters  assumed  in  the  following  paragraph. 

In  this  paper,  the  FR-aided  IDD  receiver  is  applied  in  the  i.i.d.  Rayleigh  fading  channel  with 
M  —  N  —  4,  i.e.,  spatial  multiplexing  MIMO  systems  under  4  transmit  antennas  and  4  receive 
antennas.  The  channel  is  time-invariant  for  one  symbol  period  and  changes  independently  from 
symbol  to  symbol.  Modulation  scheme  is  QPSK  and  the  SNR  is  defined  as  symbol  energy 
versus  noise  power,  i.e.,  -F|]s|2]/a2 .  For  the  FCA  algorithm  in  the  list  MIMO  detector,  Kp  is 
set  to  2  except  the  candidate  from  FR-aided  hard  detection.  Simulation  results  show  that  the 
number  of  the  candidate  list  in  FCA  is  almost  3.  For  the  ECC,  the  parallel  rate  1/2  Turbo  code 
is  adopted  with  the  generator(l,  1l0//)2j.  The  information  bit  sequence  is  of  length  1024.  For 
each  information  sequence,  we  perform  up  to  4  IDD  iterations  and  up  to  8  iterations  within  the 
turbo  decoder  as  suggested  in  [4]. 

A.  LLR  clipping  between  the  detector  and  the  decoder 

To  study  the  FFR  clipping  effect  and  to  find  the  optimal  clipping  threshold  of  the  FR-aided 
IDD  receiver,  we  examined  the  BER  performance  under  different  clipping  values  as  shown  in 
Fig.  7.  The  simulation  results  demonstrate  that  the  performance  of  the  system  can  be  clearly 
improved  by  applying  a  proper  FFR  clipping  threshold.  On  the  other  hand,  either  too  large 
clipping  values  or  too  small  clipping  values  would  degrade  the  system  performance. 
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Fig.  7.  BER  performance  under  different  LLR  clipping  threshold 


Based  on  the  simulation,  LLR  clipping  threshold  with  Lmax  =  8  is  shown  to  be  a  appropriate 
and  simple  choice  to  be  used,  which  is  also  consistent  with  the  results  in  [18]  and  [4]. 

We  also  examined  the  effect  of  iteration  times  for  the  IDD  and  Turbo  decoding  on  the  system 
performance  under  the  above  selected  clipping  threshold  with  Lmax  =  8.  The  simulation  results 
are  shown  in  Fig.  8.  It  can  be  seen  that  as  the  number  of  iterations  increases,  the  performance 
becomes  better.  However,  if  we  keep  increasing  the  number  of  iterations,  the  performance 
improvement  becomes  marginal.  From  Fig.  8,  the  performance  of  the  system  with  3  IDD 
iterations  and  4  Turbo  decoding  iterations  is  very  near  to  the  performance  of  the  system  under  4 
IDD  iterations  and  8  Turbo  decoding  iterations.  In  the  hardware  implementation,  low  complexity 
and  delay  would  be  desirable  when  facing  the  cost.  So  in  the  following  parts,  we  only  investigate 
the  LR-aided  IDD  receiver  with  3  IDD  iterations  and  4  Turbo  decoding  iterations. 

B.  Fixed-point  design  for  the  List  M1MO  detector 

The  QRD  part  in  the  list  MIMO  detector  is  located  in  the  CLLL  and  FCA  algorithms,  where 
QRD  is  used  for  the  channel  matrix  H  and  the  unimodular  matrix  T_1. 
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Fig.  8.  BER  performance  under  different  iteration  in  IDD  and  Turbo  decoding 


For  the  i.i.d.  Rayleigh  fading  channel  H  with  variance  of  one  in  our  system  configuration, 
the  probability  that  its  element  energy  exceeds  16  is  approximately  6.4  x  10-58,  which  one  can 
practically  ignore.  Therefore  5  bits  are  enough  to  represent  the  integer  part  of  H  elements.  During 
the  QRD  processing,  since  the  angles  in  the  transforming  unitary  matrix  are  well-contained  within 
[— 7t,  7r],  4  bits  are  sufficient  to  represent  the  integer  part  of  angles.  For  the  fractional  bit  width  in 
both  H  data  and  the  angles  data,  we  examined  the  accuracy  of  the  QRD  under  different  fractional 
bit  width  as  shown  in  Fig.  9,  where  the  accuracy  is  defined  as  the  difference  of  Frobenius  norm 
between  the  channel  matrix  H  and  the  product  of  Q  and  R,  i.e., 

Accuracy(QR  model)  =  \  \H  —  QR\\f  (19) 

Fig.  9  shows  that  16  bits  are  enough  for  the  fractional  wordlength  in  QRD  since  in  this  case 
both  data  and  angles  can  achieve  an  accuracy  within  0.14%.  In  sum,  FP( 5, 16)  and  FP( 4, 16) 
are  suitable  for  the  data  and  the  angles  in  the  QRD  module,  respectively. 

For  the  QRD  of  the  unimodular  matrix  T-1,  the  angles  property  is  the  same  as  that  in  the 
QRD  of  the  H,  so  the  same  FP( 4, 16)  is  adopted;  for  the  data  part,  because  all  the  entries  are 
Gaussian  integers,  we  can  reduce  the  fractional  wordlength  and  increase  the  integer  wordlength 
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Fig.  9.  QR-dcomposition  Triangular  Systolic  Array  fixed-point  model  accuracy 


while  keep  the  whole  wordlength  invariable.  Here  we  adopt  FP(  13, 8)  for  the  data,  which  shows 
that  the  performance  of  system  would  be  almost  the  same  as  that  of  floating-point  system  in  the 
following  simuation.  Besides,  due  to  the  identical  whole  wordlength  compared  with  the  H,  the 
same  QRD  hardware  implementation  can  be  used  for  both  unimodular  matrix  T  1  and  channel 
matrix  /  / 

For  the  CLLL  part,  the  fixed-point  design  is  mainly  referred  to  our  former  work  in  [12]  .  The 
fixed-point  representation  of  some  key  parameters  in  CLLL  are  as  follows:  the  integer  bits  for 
u,  T,  and  internal  datapath  of  Householder  CORDIC  are  11  bits,  9  bits,  and  5  bits  respectively; 
the  fraction  bits  for  both  Q  and  R  are  13  bits;  the  integer  bits  of  R  after  size  reduction  and 
basis  updating  are  5  bits  at  most. 

When  only  applying  the  fixed-point  design  for  the  list  MIMO  detector  under  the  above  analysis 
in  the  LR-aided  IDD  receiver,  its  performances  compared  with  the  floating-point  system  under 
LLR  clipping  are  depicted  in  the  Fig.  10.  The  results  show  that  the  BER  performance  degradations 
of  the  fixed-point  design  for  the  list  MIMO  detector  are  kept  less  than  0.2  dB. 
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Fig.  10.  BER  performance  under  fixed-point  MIMO  detector  in  LR-aided  IDD  receiver 


C.  Fixed-point  design  for  the  Turbo  decoder 

Fixed-point  design  for  Turbo  decoding  has  been  well  studied  in  the  literature  [19],  [20],  [21], 
and  [22].  The  most  important  parts  of  the  fixed-point  implementation  for  Turbo  decoding  are 
the  BM,  FW,  and  BW  parts  as  shown  in  Section  III-E.  The  fixed-point  implementation  in  this 
paper  is  mainly  based  on  the  results  in  [22].  Here  the  bits  width  for  the  BM,  FW,  and  BW  we 
adopted  are  FP( 5,3),  FP( 7,3),  and  FP( 7,3),  respectively.  And  the  bits  width  for  both  the 
extrinsic  information  and  prior  information  is  FP( 5,3). 

When  only  applying  the  fixed-point  design  for  the  Turbo  decoder  under  the  above  analysis  in 
the  LR-aided  IDD  receiver,  its  performance  differences  compared  with  the  floating-point  system 
under  LLR  clipping  are  depicted  in  the  Fig.  11.  The  results  show  that  the  BER  performance 
degradations  of  the  fixed-point  design  for  the  Turbo  decoder  are  kept  within  0.1  dB. 

D.  Fixed-point  performance  of  the  whole  LR-aided  IDD  receiver 

Based  on  the  above  finite  wordlength  analysis  for  the  MIMO  detector  and  the  Turbo  decoder, 
and  by  adding  the  fixed-point  design  for  the  LLR  information  between  detector  and  decoder, 
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Fig.  11.  BER  performance  under  fixed-point  Turbo  decoder  in  LR-aided  IDD  receiver 


we  can  get  the  fixed-point  performance  of  the  whole  LR-aided  IDD  receiver  for  coded  MIMO 
systems.  Because  the  clipping  threshold  is  set  to  8  under  the  former  simulation  verification, 
the  fixed-point  design  of  the  LLR  values  can  adopt  4  bits  integer  wordlength  with  saturation 
operation.  For  the  fraction  wordlength,  our  simulations  show  that  even  2  bits  can  keep  a  desirable 
performance,  which  could  reduce  the  complexity  in  the  VLSI  implementation. 

By  using  FP( 4,2)  fixed-point  representation  for  the  LLR  information  between  the  MIMO 
detector  and  the  decoder,  and  adding  all  fixed-point  designs  of  the  former  analysis  in  the  LR- 
aided  IDD  receiver,  its  performance  differences  compared  with  the  floating-point  system  under 
LLR  clipping  are  depicted  in  the  Fig.  12.  The  results  show  that  the  BER  performance  degradation 
of  the  fixed-point  design  for  the  whole  LR-aided  IDD  receiver  is  kept  less  than  0.2  dB. 

V.  Conclusion 

In  this  paper  we  have  demonstrated  fixed-point  implementation  for  the  whole  LR-aided  IDD 
receiver  in  the  MIMO  coded  systems,  which  includes  the  fixed-point  design  for  the  key  algorithms 
like  CLLL,  FCA,  QRD,  LLR  clipping,  and  Turbo  decoding.  The  results  of  the  fixed-point  system 
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Fig.  12.  BER  performance  under  fixed-point  LR-aided  IDD  receiver 


show  that  its  BER  performance  degradation  is  within  0.2dB  compared  with  the  floating-point 
system.  With  these  results  the  hardware  implementation  of  the  LR-aided  IDD  receiver  can  be 
straightforwardly  implemented  in  VLSI  and  LPGA,  which  is  also  our  next  work  consideration. 
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